Glassdoor Germany Data Exploration

Introduction

The project is about wrangling and exploring a Glassdoor dataset to gain insights about the German job market in the field of data science and related disciplines. In particular the job listings with focus on the relationship of job titles, locations, number of employees and industry will be investigated. Additionally, reported salaries for job titles of the job listing employers will be examined. This notebook covers two parts. The first part are data wrangling techniques that are used to obtain the data for the exploratory analyses. This analyses represents the second part and will focus on the use of data visualizations techniques.

Motivation

I live in Munich, Germany and enrolled Udacity's Data Analyst Nanodegree program to deepen my knowledge of data analysis. This project was submitted as part of this program in order to demonstrate the importance and value of exploratory and explanatory data visualization techniques in the data analysis process. I choose to select not any of the given Udacity datasets, because I wanted to search and find a dataset focusing on the German job market in the field of data science and related disciplines for this project. Besides the data visualization tasks, I wanted to train my data wrangling techniques, too.

Dataset

The examination done within this notebook relies on the dataset provided by André Sionek on Kaggle. The information provided on Kaggle and the GitHub repository states how the data was scraped and prepared. The data was scraped on the 10th of December 2019 from Glassdoor.

Table of Contents

  1. Preliminary Data Wrangling
    1.1. Gathering Data
    1.2. Assessing Data
    1.2.1. Tidiness
    1.2.2. Quality
    1.3. Cleaning Data
    1.3.1. Missing Data
    1.3.2. Tidiness 1
    1.3.3. Quality 1
    1.3.4. Tidiness 2
    1.3.5. Quality 2
    1.4. Storing Data
  2. Exploratory Analysis using Visualizations
    2.1. Dataset Structure
    2.2. Univariate Exploration
    2.2.1. Job Titles
    2.2.2. Salaries
    2.2.3. Summary
    2.3. Bivariate Exploration
    2.3.1. Job Titles
    2.3.2. Salaries
    2.3.3. Summary
    2.4. Multivariate Exploration
    2.4.1. Job Titles
    2.4.2. Salaries
    2.4.3. Summary

1. Preliminary Data Wrangling

1.1. Gathering Data

The data will be gathered from 3 different data sources:

  1. glassdoor.csv.gzcontains detailed job information for each listed job on the first 30 pages returned by searching each of the following search terms for every country on Glassdoor:

    • data-scientist
    • software-engineer
    • data-analyst
    • research-scientist
    • business-analyst
    • product-manager
    • project-manager
    • data-engineer
    • statistician
    • dba
    • database-engineer
    • machine-learning-engineer

      The job information contains i.e. the job title, location, number of employees and references to the following salaries and currency exchange datasets.

  2. For each employer of a listed job the fetched job information contained data about base salaries reported on Glassdoor. This data is provided as separate dataset glassdoor_salary_salaries.csv.gz. Reported salaries are grouped by job title and an aggregated low, average and high base salary is stated.

  3. The dataset currency_exchange.csv contains currency exchange rates for the date the data was scraped.

The job information in dataset 1. and 2. was returned by the Glassdoor API endpoint https://www.glassdoor.co.uk/Job/json/details.htm called using the query string parameter jobListingId with each job id found by the search.

1.2. Assessing Data

After gathering each of the above pieces of data, the data will be assessed for tidiness and quality issues. The features will be described. The detected issues will be documented. Only features and issues that satisfy the project motivation and are required to investigate the relationship of job titles, locations, number of employees and reported salaries will be covered.

Let's deep dive into the glassdoor dataset assessment

The type for the first value, the count of unique values, the count of values that equal 0 and the count of missing values for each column is printed below.

Print all columns with values of record with index 1 as example, because this record seems to have salaries data.

Create new temporary dataframe used to continue asssessment just for jobs listed in Germany.

Continue assessment with selected features only that seem to be relevant for investigation and need further verification, i.e. 3 features seem to represent a job title.

Let's focus on assessment of location related columns

The columns header.location and map.location are equal and one of them seems to be the best choice, because gaTrackerData.location differs and contains NaN values.

Let's focus on assessment of locationType related columns

The columns header.locationType and gaTrackerData.locationType are not equal. The column header.locationType would be the better choice, because it idnidcates when a location is not a city. Anyway, the column is not required for further investigation. Only locations which are cities will be used for investigation.

Let's focus on assessment of employer related columns

The columns header.employerName, map.employerName and gaTrackerData.empName are not equal. The column header.employerName seems to be the best choice, because it contains the most values that are not NaN values. The only NaN values are related to some Freelancer listings only having a description and location in place. Anyways, the colum header.employerName contains almost 2000 unique values, which would lead to huge additional cleaning efforts not necessarily required for this investigation. That's why none of the colums related to employer will be kept.

Let's focus on assessment of job_title related columns

The columns header.jobTitle and gaTrackerData.jobTitle are equal. All values in the column header.normalizedJobTitle are different compared to the column header.jobTitle. Based on the assessment, the column header.employerName seems to be the best choice, because it contains the most values that are not NaN values and ca. 90% of the original jobs titles could be summarized to 12 general job titles.

Let's focus on assessment of industry related columns

The columns overview.industry and gaTrackerData.industry are not equal, but the differences are minor, i.e. 'Shops' vs. 'Stores'. The count of unique values for both columns is equal. Let's choose the column overview.industry. Furthermore, it seems like there are no cleaning efforts required. Let's keep the column overview.industry although it would it is not necessarily required for this investigation.

Let's focus on assessment of number of employees related columns

The columns overview.size and gaTrackerData.empSize are not equal, but the differences are minor, i.e. '10000+ employees' vs. '10000--1'. The count of unique values for both columns is equal. Let's choose the column overview.size, because the values are more descriptive.

Let's focus on assessment of posted date

The column header.posted will be kept

Let's focus on assessment of job.description

The column job.description contains free text and there exists not any NaN value. It would be possible to search for certain responsibilities or requirements, e.g. design of data pipelines or programming language skills. Anyways, more than 3000 jobs descriptions are unique, which would lead to huge additional cleaning efforts not necessarily required for this investigation. That's why the column job.description will not be kept.

Let's focus on assessment of salary.salaries

The column salary.salaries will be kept. 4159 records of reported salaries exists. The column represents an id to join the glassdoor data with the salaries data.

Let's focus on assessment of currency_code related columns

The column salary.currency.currencyCode will be kept. The column will be used to get the exchange rates required to finally to convert all salaries to EUR.

Continue assessment with features only that have been verified and selected as relevant for investigation. Additionally rename the columns to increase usability.

Continue assessment with focus on salaries table by temporarily merging the salaries data for job listing record in jobs_germany table. Additionally rename the columns to increase usability.

The table salaries contains mutlipe entries with the same id, because an employer on Glassdoor can have job titles for which salaries have been reported. That's why the new table salaries_germany now contains duplicate compared to the original table jobs_germany that was used as base for the merge. Hence, for the job listings and the salaries investigation separate datasets will be required.

Non of the reported salaries not NaN is duplicated (considering job title, pay period and id).

The investigation will only focus on 'ANNUAL' pay_period, because this covers more than 80% of the non NaN data.

The feature salary_type is not need, because almost all records have the value 'employer'.

It seems that records with pay_percentile_50 values that are NaN, have no median / percentile 50 calculated when less than 3 salaries per job title have been reported.

An annual salary below 100 as lower average (percentile 10) seems to be fake, but job title won't be covered in analysis.

An annual salary above 150000 as lower average (percentile 10) seems to be correct.

For 2 records no job title related to the salary has been reported.

Let's assess, if required exchange rates for converting all reported salaries to EUR exist

The exchange rates from EUR to USD and GBP to USD exist. This is sufficient for conversion of all salaries to EUR.

1.2.1. Tidiness

1.2.2. Quality

glassdoor table

salaries table

currency table

1.3. Cleaning Data

Each of the quality and tidiness issues documented will be cleaned as follows. Initially tidiness followed by quality issues will be cleaned. Afterwards, tidiness and quality issues are cleaned again. Cleaning some quality issues before finalizing tidiness allows to avoid duplicate cleaning effort. The results of cleaning will be 2 high quality and tidy master pandas DataFrames glassdoor_clean and salaries_clean, which will be required to satisfy the project motivation. For each issue the cleaning process is structures by define, code, and test steps which are documented.

1.3.1. Tidiness 1

The table glassdoor contains records of jobs around the world, but only records for Germany will be investigated

Define
Select only records that contain any term that specifies the records belongs to Germany in column map.country.

Code

Test

The table glassdoor consists of 163 columns and just a few of them are relevant features for the investigation

Define
Based on assessment and the investigation scope, just the following features will be considered for exploratory data visualization:

The colum header.employerName will not be kept. It contains almost 2000 unique values, which would lead to huge additional cleaning efforts not necessarily required for this investigation. The employer relationship between reported job titles and salaries will not be investigated.

The column job.description contains free text and there exists not any NaN value. It would be possible to search for certain responsibilities or requirements, e.g. design of data pipelines or programming language skills. Anyways, more than 3000 jobs descriptions are unique, which would lead to huge additional cleaning efforts not necessarily required for this investigation. That's why the column job.description will not be kept.**

Code

Test

The column names of the tables glassdoor and salaries contain . character and should be renamed to increase usability and unambiguity

Define
Remove . characters. Use _ character as word separator, if required. Use lower case only. Use short and descriptive terms.

Code

Test

1.3.3. Quality 1

Some columns will be kept in final cleaned job listings and salaries table. In order to avoid cleaing the quality issues of those columns twice, start cleaning them now and afterwards merge glassdoor with salaries dataset and override salaries with result.

Some location records contain , character followed by country or state, which will be removed to avoid ambiguity

Define
Define a mask for to identify all records for which location contains a , character. Split such location values using the , character as separator. Finally, replace the original location values with the value on the left side of comma without any leading and trailing spaces.

Code

Test

For some location records english translation is used, umlauts are missing or duplicates exist

Define
Replace 'DE' and 'Germany' with 'Deutschland'. Replace 'Nuremberg' with 'Nürnberg'. Replace 'Cologne' and 'Koln' with 'Köln'. Replace 'Cologne' and 'Munich' with 'München'. Replace 'Heidenheim' with 'Heidenheim an der Brenz'. Replace 'Lindau (Bodensee)' with 'Lindau'.

Code

Test

Some locations are not a city or not in Germany, i.e. districts, states in or not in Germany, cities not in Germany or country notations

Define
Replace locations that are not cities in Germany with nan.

Code

Test

The n_employees should be a categorical data type and Unknown should be replaced by NaN

Define
Replace Unknown with NaN. Create a CategoricalDtype inclusive an order. Change the dtype to the previously created CategoricalDtype.

Code

Test

1.3.4. Tidiness 2

For the salary investigation, it's required to identify the salary for German employers only

Define
Merge the glassdoor_clean with salaries_clean table and override salaries_clean with results (after cleaning quality issues in glassdoor_clean columns that will be kept in final salaries_clean table, see here). The new table salaries_clean contains mutlipe entries with the same id, because an employer on Glassdoor can have mutliple job titles for which salaries have been reported. That's why the table salaries_clean after the merge contains duplicates compared to the original table glassdoor_clean that was used as base for the merge. Hence, for the job listings and the salaries investigation separate datasets will be required.

Code

Test

Add the exchange rates required for converting the reported salaries to EUR

Define

Extract the exchange rate required to convert salaries to EUR from table currency_clean. Afterwards, merge salaries_clean with the extracted currency exchange rates. Code

Test

Drop columns not required for salaries investigation

Define
Let's drop the following features, because they are not required for the salary investigation as reasoned below.

Code

Test

Drop columns not required for job listings investigation

Define
Let's drop the following features, because they are not required for the job listings investigation as reasoned below.

Code

Test

1.3.5. Quality 2

Job posted date should be datetime data type

Define
Convert the data type for column posted to datetime64.

Code

Test

Define
Drop all records were base_pay_count, salaries_job_title, pay_period, pay_percentile_10, pay_percentile_50, pay_percentile_90, currency_code and exchange_rate is null Furthermore, drop the 2 records were a salary without a job title has been reported.

Code

Test

Convert all reported salaries to EUR using the currency exchange rates

Define
Get the EUR to USD exchange rate. Now multiple all reported salary percentile values with the repsective exchange rate divided by the EUR to USD exchange rate. Update the currency_code to EUR and drop the exchange_rate column.

Code

Test

There are many unique original job_title values, which need to be normalized / summarized to maximum 10 general job titles covering more than 80% of the records

Define
As described in gathering data, the job listings are based on search terms like data-scientist, software-engineer and others. Anyway, there appear more than 2000 unique job titles. This requires a normalization to more general titles. For visualization purposes, not more than 10 unique titles but covering 80% of the data would be ideal. Implementation approach: Use defined regex terms to search for patterns to be assigned to a general job title. Create a new categorical column normalized_job_title that represents those general job titles.

Code

Test

There are many unique original salaries_job_title values, which need to be normalized / summarized to the values determined based on Job listings

Define
The previously defined and normalized job titles for the job listings (see normalized_job_title in table glassdoor_clean) will be used as category-assignment for salaries_job_title, too. The categories will be stored in the categorical normalized_job_title column. The column salaries_job_title has 620 unique job titles for which salaries have been reported. The number of records assigned to a category will be only ~21% percentage of the total number of reported salaries and ~28% percentage of the total number of job listing records assigned to those categories.

Code

Test

The column pay_period should be a categorical data type

Define
Create a CategoricalDtype with an order. Change the dtype to the previously created CategoricalDtype.

Code

Test

More than have reported salaries have no percentile 50 value

Define
There records that have a NaN value for pay_percentile_50 and pay_percentile_10 as well as pay_percentile_90 is given. Calculate an estimated percentile 50 value for those records by dividing the sum of pay_percentile_10 and pay_percentile_90 by 2.

Code

Test

1.4. Storing Data

2. Exploratory Analysis using Visualizations

2.1. Dataset structure

What is the structure of your dataset?

There are 4,604 job listings for Germany with 6 features in the glassdoor_clean dataset. All features, besides posted are qualitative variables. There is no quantative variable. The features are:

All employer that are a subset of the job listings glassdoor_clean dataset, which have reported job titles inclusive salaries on Glassdoor are part of the salaries_clean dataset. The dataset contains 5,610 reported job titles and their salaries. It consists of the following 11 quantative and qualitative features:

What are the main features of interest in your dataset?

To gain insights about the German job market in the field of data sciene I am particular interested in the relationship of job titles with locations, number of employees and industry. That's why the features of interest for the job listings (glassdoor_clean dataset) are location, industry, n_employees, posted and normalized_job_title. The feature job_title has too many unique values and is not suitable for visualization.

Additionally, it would be interesting to see the base salaries for the job titles out of the listed jobs. Therefore, reported salaries for job titles of the job listing employers should be examined. The main features of interest of the salaries_clean dataset are industry, n_employees, pay_percentile_10, pay_percentile_50, pay_percentile_90 and normalized_job_title.

What features in the dataset do you think will help support your investigation into your features of interest?

For the job listings (glassdoor_clean dataset) exists no further feature, which will help support the investigation.

The features base_pay_count andpay_period will likely help to support the investigation for the reported job titles inclusive salaries (salaries_clean dataset). The other features are likely not required.

2.2. Univariate Exploration

The distributions of the individual main variables of interest and other variables that might support the investigation will be investigated. Any unusual points, distributions, required transformations or operations on the data to tidy, adjust, or change the form of the data will be described.

2.2.1. Job Titles

Let's start with the univariate exploration of the job listings (glassdoor_clean dataset). Initially, the share of job listings extracted and cleaned for Germany and shares of records that are not null for the main features of interest will be visualized.

The pie chart above illustrates the share of job listings records for Germany that have been extracted and cleaned from the original dataset.

95.4% of the job listings have known values for the city (location). For 89.7% of the job listings a normalized job title (normalized_job_titles) has been determined. The number of employees (n_employees) is known for 80.0%. 63.9% have a industry (industry) given. Only 53.6% of the records have no null values for any feature.

The histogram above shows the distribution of posted dates. Due to some outliers the data looks very left skewed. Less than 3% (123 of 4604 records) of the data has a posted date before the 11th of October 2019. Let's ignore the outliers and zoom in. There is no need to drop the outliers.

The posted date distribution above unveils that 97% of the job listings has been posted between 2019-10-11 and 2019-11-10. It's a multimodal distribution.

The countplot above shows the distribution of 89.75% of the job listings which have a normalized_job_title. The distribution is sorted from high to low frequency from left ot right. Out of this 89.75 %, for roughly 40% of the job listings the normalized job titles are Software Engineer and Data engineer. The number of job listings for Software Engineers is slightly larger than for Data engineers. Project Manager, Data Scientist and Product Manager have a roughly similar job listing frequencies (+-50 counts), which represents in sum approx. 35% of the listings. The next higher frequencies are job listings for Business Analyst and Data Analyst. Those sum up to roughly 16%. The remaining 9% are Researcher, Product Owner, Machine Learning Engineer, Engineering Manager and Scrum Master job listings. The Engineering Manager and Scrum Master job listing frequencies are only covering 10 records and are outliers. For further investigations they will not be considered. The normalized job titles do not have information about the job level, which means there is no differentiation between interns, entries or seniors.

The distributions of the sorted categorical variable n_employee is illustrated as countplot above. For roughly 80% of the job listings records the company size known. Those job listings are well distributed among all company sizes. Companies with more than 10000+ employees provide the most job listings (25%). The two smallest companies sizes have the next highest shares with each almost 20%. The second largest companies with 5001 to 10000 employees provide the lowest number of job listings. The remaining company sizes have a frequency in between.

The location is a city. It is known for 96% of the job listings. Since, the dataset contains 340 unique locations, only the Top 10 locations sorted by highest frequency have been plotted. The top 10 locations are representing 65% (2840 records) of the job listings that have a location provided. 55% (2433) of the records that have a location are within the top 5 locations. Berlin has the highest frequency with more than 25%. This is more than twice of the job listings for the location München (Munich) with roughly 12%. The third top location Frankfurt am Main, the forth Hamburg and fifth Köln (Cologne) have roughly half as much of listings as München. Other top 10 cities for job listings are Düsseldorf, Stuttgart, Karlsruhe, Dresden and Hannover.

For 64% of the job listings the employers industry is known. Since, the dataset contains 91 unique industries, only the Top 10 industries sorted by highest frequency have been plotted. The top 10 industries are representing 60% (1780 records) of the job listings that have an industry provided. 47% (1383) of the records that have an indsutry are within the top 5 industries.

The Internet industry has the highest frequency with more than 16%. The IT Services industry covers 11%, the Computer Hardware & Software industry 7% and Enterprise Software & Network Solutions industry covers roughly 6.5%. The fifth top industry is Consulting with roughly 6.5%, too. It's remarkable that the top 4 industries are IT-related and sum up to a share of approx. 40.5%.

Other top 10 industries can be taken from the graphic above.

2.2.2. Salaries

After the univariate exploration of the job listings (glassdoor_clean dataset), the project continues with the reported job title salaries (salaries_clean dataset). Initially, the shares of records that are not null for the main features of interest will be visualized.

For only 21.2% of the reported job title salaries a normalized job title (normalized_job_titles) has been determined. The normalized job titles are the same unique values as for job listings. The number of employees (n_employees) is known for 99.4%. 97.5% have no null values for industry (industry). Only 21.0% of the records have no null values for any feature.

The histogram above shows the distribution of the numeric base_pay_count variable. Due to some outliers the data looks very right skewed with a long tail. Let's apply a log transformation.

Each record has a value for base_pay_count. After applying the log transformation, a strong right skewed unimodal distribution is precisely visible. The highest counts are for 1 and 2 values. Those 1 and 2 values sum up to more than 50%, which means that just 1 or 2 salary per job title and pay period has been reported for more than 50% of the records. Another roughly 25% of the records has more than 3 but not more than 10 reported salaries. Less than roughly 5% of the records have more than 50 salaries reported. The remaining 20% have more than 10 but less than 50 reported salaries. It's seems to be worth to use this variable in further investigations, e.g., if the sum of base_pay_count for records grouped by normalized_job_title and pay_period is high, then the significance of the low, median and high base pay salaries is high.

The distribution of the sorted categorical pay_period variable is displayed as histogram above. There exist no null values. 86% (4826 records) of the salaries reported represent the annual base pay salary. Monthly reported salaries are roughly 5.5% (313) of the records. The remaining 8.5% (471) are hourly reported salaries.

The further analysis will only focus on annually reported salaries.

The histogram above shows the distribution of pay_percentile_10, pay_percentile_50 and pay_percentile_90 for annual reported salaries in a single histogram. For those 3 features no null values exist. The illustration does not replace the need to plot each numeric variable in a dedicated histogram to increase the explanatory power and readability. Still, it shows that all 3 variables seem to be right skewed and have a long tail. This means that lower salaries are more often reported than higher salaries. Let's plot all features in a dedicated histogram using a log transformation scale.

The right skewed distribution was confirmed by appyling a log transformation for pay_percentile_10 and plotting the mean and median in the histogram above. The top of the reported low base salary is around the median of 30k€. The interquartile range (IQR) starts roughly 7k€ (23k€) below and end 11k€ (41k€) above the median.

The right skewed distribution was confirmed by appyling a log transformation for pay_percentile_50 and plotting the mean and median in the histogram above. The top of the reported average base salary is slightly less than the median of 34k€. The interquartile range (IQR) starts roughly 10k€ (24k€) below and ends 17k€ (51k€) above the median.

The right skewed distribution was confirmed by appyling a log transformation for pay_percentile_90 and plotting the mean and median in the histogram above. The top of the reported average base salary is roughly at 33k€, which is less than median of 39k€. The interquartile range (IQR) starts roughly 12k€ (27k€) below and ends 23k€ (62k€) above the median.

The histogram above shows the distribution of pay_percentile_10, pay_percentile_50 and pay_percentile_90 for annual reported salaries in a single histogram. The feature bars do not overlay and a log transformation was applied. Additionally, the medians and means for all features are plotted.

Overall, the reported base pay salaries cover a large range from very low to very high reported salaries. There are many outliers. The lower or higher the reported base pays are, the fewer reports are available. It will be interesting to see the relationship with normalized job titles even without having the information about the job level.

The countplot above shows the distribution of 21.16% (1187) of the records that have a normalized_job_title. The distribution is sorted from high to low frequency from left ot right. Out of this 21.16%, for roughly 49% of the reported salaries the normalized job titles is Software Engineer. This is more than twice of the second largest frequency share representing Project Manager with approx. 19%. Business Analyst have the third largest frequency share with approx. 13%. Just a slightly different share of roughly 8% have Researcher and Product Owner. The top 5 cover 97% of the frequency share. The remaining 3% share are covered by Engineering Manager, Data Analyst, Data Scientist, Data Engineer, Machine Learning Engineer and Product Owner. That's why for further reported salary investigations only the top 5 will be considered.

For the normalized job titles determined for job listings are less frequent reported salaries present. Especially, the job titles Data Analyst, Data Scientist and Data Engineer had high frequencies for job listings, but almost have no salary reports.

Additionally, we have no information about the job level for reported salaries, which means there is no differentiation between interns, entries or seniors.

The distributions of the sorted categorical variable n_employee is illustrated as countplot above. For roughly 99% of the reported salary records the company size known. Especially for large companies with more than 10000+ employees, reported salaries are provided (67%). The second largest companies with 5001 to 10000 employees have a frequency share of 7%, which is similar to companies with 501 to 1000 employees. Companies 1001 to 5000 employees have the second largest frequency share with 14%. The remaining small companies only represent 5% of the frequency share.

For 97.54% of the reported salaries records the employers industry is known. Since, the dataset contains 65 unique industries, only the top 10 industries sorted by highest frequency have been plotted. The top 10 industries are representing approx. 67% of the reported salaries that have an industry provided. Roughly 52% of the records that have an industry are within the top 5 industries.

The Internet industry has the highest frequency with approx. 25%. The IT Services industry covers 8%, the Consulting industry 7%, Biotech & Pharmaceuticals industry 7% and Computer Hardware & Software industry 6%. Those 4 industries have a comparable share that just slightly differs. This is true for the remaining top 10 industries, which can be taken from the graphic above. The share of each is slighly above or below of 3%.

It's remarkable that the top 10 contain 4 IT-related industries, which sum up to a share of approx. 43%.

2.2.3. Summary

Job listings (glassdoor_clean dataset)

53.6% of the records are complete. that means, they have no null values for the features posted, location, normalized_job_title, n_employees and industry.

97% of the job listings have been posted between 2019-10-11 and 2019-11-10. Based on the fact that the data was gathered on the 10th of December 2019, most listings are not older than 2 months. The remaining listings, especially the ones from 2018 and earlier 2019 might be outdated or a kind of permanent listings.

For 89.75% of the job listings a normalized job title could be assigned. The titles with the highest frequency are Software Engineer and Data engineer. Both represent already 40% out of the 89.75% of the listings. Other relevant titles that should be considered in further investigation are Project Manager, Data Scientist, Product Manager, Business Analyst, Data Analyst, Researcher, Product Owner and Machine Learning Engineer. The Engineering Manager and Scrum Master job listing frequencies are only covering 10 records and are outliers. For further investigations they will not be considered. The normalized job titles do not have information about the job level, i.e. there is no differentiation between interns, entries or seniors.

For roughly 80% of the job listings records the company size in terms of number of employees known. Those job listings are well distributed among all company sizes. Companies with more than 10000+ employees provide the most job listings (25%). The two smallest companies sizes have the next highest shares with each almost 20%.

The location (city) is known for 96% of the job listings. The top 10 out of 344 unique locations are representing 65% of the job listings that have a location provided. 55% of the records that have a location are within the top 5 locations. Berlin has the highest frequency with more than 25%. This is more than twice of the job listings for the location München (Munich) with roughly 12%. The third top location Frankfurt am Main, the forth Hamburg and fifth Köln (Cologne) have roughly half as much of listings as München. Other top 10 cities for job listings are Düsseldorf, Stuttgart, Karlsruhe, Dresden and Hannover. The further investigation will focus just on the top 5 cities.

For 64% of the job listings the employer industry is known. The top 10 out of 91 unique industries are representing 60% of the job listings that have an industry provided. 47% of the records that have an indsutry are within the top 5 industries. The Internet industry has the highest frequency with more than 16%. The IT Services industry covers 11%, the Computer Hardware & Software industry 7% and Enterprise Software & Network Solutions industry covers roughly 6.5%. The fifth top industry is Consulting with roughly 6.5%, too. It's remarkable that the top 4 industries are IT-related and sum up to a share of approx. 40.5%. The further investigation will focus just on the top 5 industries.

No changes on the form of data have been applied.

Reported salaries (salaries_clean dataset)

21% of the records are complete. They have no null values for the features base_pay_count, pay_period, pay_percentile_10, pay_percentile_50, pay_percentile_90, normalized_job_title, n_employees and industry.

All records have a value for the numeric variable base_pay_count. A log transformation was applied, because the data is covering a large range with a strong right skewed unimodal distribution. The highest counts are for 1 and 2 values. Those 1 and 2 values sum up to more than 50%, which means that just 1 or 2 salary per job title and pay period has been reported for more than 50% of the records. Another roughly 25% of the records has more than 3 but not more than 10 reported salaries. Less than roughly 5% of the records have more than 50 salaries reported. The remaining 20% have more than 10 but less than 50 reported salaries. It's seems to be worth to use this variable in further investigations, e.g. if the sum of base_pay_count for records grouped by normalized_job_title and pay_period is high, then the significance of the low, median and high base pay salaries is high.

The categorical pay_period has no null values. 86% of the salaries reported represent the annual base pay salary. Monthly reported salaries are roughly 5.5% (313) of the records. The remaining 8.5% (471) are hourly reported salaries. The further analysis will only focus on annually reported salaries.

There are no null values for pay_percentile_10, pay_percentile_50 and pay_percentile_90 for annual reported salaries. Those 3 features have a right skewed distribution and a large data range, which required applying a log transformation.

There are a many outliers representing very low and very high reported base pay salaries. The lower or higher the reported base pays are, the fewer reports are available. It will be interesting to see the relationship with normalized job titles even without having the information about the job level.

21% of the records have a normalized_job_title. Out of this 21% have roughly 49% of the reported salaries the normalized job titles Software Engineer. This is more than twice of the second largest frequency share representing Project Manager with approx. 19%. Business Analyst have the third largest frequency share with ca. 13%. Just a slightly different share of roughly 8% have each Researcher and Product Owner. The top 5 cover 97% of the frequency share. The remaining 3% share consists of Engineering Manager, Data Analyst, Data Scientist, Data Engineer, Machine Learning Engineer and Product Owner. That's why for further reported salary investigations only the top 5 will be considered.

For the normalized job titles determined for job listings are less frequent reported salaries present. Especially, the job titles Data Analyst, Data Scientist and Data Engineer had high frequencies for job listings, but almost have no salary reports. Additionally, we have no information about the job level for reported salaries, which means there is no differentiation between interns, entries or seniors.

For roughly 99% of the reported salary records is the company size known. Especially for large companies with more than 10000+ employees, reported salaries are provided (67%). The second largest companies with 5001 to 10000 employees has a frequency share of 7%, which is similar to companies with 501 to 1000 employees. Companies 1001 to 5000 employees have the second largest frequency share with 14%. The remaining small companies only represent 5% of the frequency share.

For 97.54% of the reported salaries records the employers industry is known. The top 10 out of 65 unique industries are representing approx. 67% of the reported salaries that have a industry provided. Roughly 52% of the records that have a industry are within the top 5 industries. The Internet industry has the highest frequency with approx. 25%. The IT Services industry covers 8%, the Consulting industry 7%, Biotech & Pharmaceuticals industry 7% and Computer Hardware & Software industry 6%. Those 4 industry have a comparable share that just slightly differs. It's remarkable that the top 10 contain 4 IT-related industries, which sum up to a share of approx. 43%. The further investigation will focus just on the top 5 industries.

No changes on the form of data have been applied.

2.3. Bivariate Exploration

Now, investigate the relationships between pairs of variables. The observed relationships and how the features of interest vary with other features will be described. Any interesting relationships between the other features that might support the investigation of the main features of interest will be covered, too.

First the dataset of the job listings (glassdoor_clean dataset) will be explored. Afterwards, the exploration of the reported job title salaries (salaries_clean dataset) follows.

2.3.1. Job Listings

The code below creates the following dedicated subsets of the data:

All features of main interest are qualitative variables. For visualization use countplots and different color encoding for the hue variable. For n_employees variable use the sequential color palette 'Blues', use 'Greens' for normalized_job_title and 'flare' for location.

The countplot above visualizes the relationship between the count of normalized job titles and the number of employees. The number of employees are color encoded.

For almost each normalized job title, large companies with more than 10,000 employees seem to have the largest share of job listings. For Data Engineer it's the second largest share. Only for Product Owner the share of the largest company is clearly smaller.

The smallest companies with less than 201 employees have often the second largest share of job listings. It's eyecatching that small companies count more job listings for Product Owner than large companies.

The relationship between normalized job titles and the number of employees in terms of job listing counts is shown again as heatmap above. The heatmap confirms the observations already made in the previously shown countplot. It was plotted, because on the y-axis of the countplot we have 10 normalized jobs and no count values are displayed. The heatmap annotates all value counts and uses a color encoding that makes large values more eye-catching by a darker blue color.

By focusing on the relationship of location and the number of employees, the countplot above unveils that especially in Berlin, München (Munich) and Frankfurt am Main large companies with more than 10,000 employees have clearly the largest share of job listings. In Frankfurt am Main the remaining share of job listings is almost equally distributed among the other company sizes. The smallest companies with less than 201 employees have clearly the second largest shares of job listings in Berlin and München. In Frankfurt am Main companies with less than 51 employees have the smallest share. The job listings for Hamburg and Köln (Cologne) are well distributed among all company sizes.

Plotting a heatmap in addition is not necessary, because the readability when having just 5 locations is very good.

The relationship between industry and the number of employees is shown in the countplot above. It's eye-catching that the job listings of large companies with more than 10,000 employees have a very high share in the Internet industry. Companies with 1001 to 5000 employees have the second largest share in Internet industry. In IT Services, Computer Hardware & Software and Enterprise Software & Network Solutions have small companies with 51 to 200 employees the largest share. In Consulting industry companies with 501 to 1000 employees have the largest share. In each of the top 5 industries, companies with 5001 to 10000 employees seem to have the smallest share of job listings. In Enterprise Software & Network Solutions industry those companies have no job listings.

The code above creates the following dedicated subsets of the data:

The countplot above displays the relationship between normalized job titles and the location. The normalized job titles are color encoded. Especially in Berlin, but as well in München (Munich) and Köln (Cologne) Software Engineer listings have the highest share. For Frankfurt am Main Business Analysts and for Hamburg Data Scientists are listed most. The second largest shares for all locations, besides Berlin, are Data Engineer listings. In Berlin, it is the third largest share, which is still larger than the largest share of listings of other locations. The second largest share in Berlin by far is Product Manager. This is unusual compared to the large shares of other location. To confirm the observations, let's plot a heatmap below.

The heatmap confirms the observations made with the countplot above.

The relationship of normalized job titles and industry is illustrated above. It's eye-catching that the job listings of Software Engineers have the largest share in Internet, Computer Hardware & Software and Enterprise Software & Network Solutions by far. In IT Services and Consulting industry it's a large share, but not the largest. Instead, Data Engineers are listed most for those industries. The second largest share in Internet industry by far is Product Manager. This is unusual compared to the large shares of other industries.

The countplot above displays the relationship between industry and the location. The industry is color encoded. Especially in the Internet Industry, but as well in IT Services, Computer Hardware & Software and Enterprise Software & Network Solutions job listings in Berlin have the largest share by far. München (Munich) has for those industries the second largest share. Only for Consulting industry München (Munich) has the most listed followed by Berlin. The job listings for Hamburg are focused in IT Services industry. Frankfurt has only a few listings in Internet and Enterprise Software & Network Solutions industry.

Finally, investigate the relationship of other features with the features of main interest.

The relationship between the count of jobs posted per date and the number of employees is illustrated in the scatterplot above. The number of employees are color encoded. The size of the marker depends on the counts of jobs posted per date. The more jobs have been posted per date, the larger the diameter of the marker is. The plot does not unveil any interesting relationships.

The relationship between the count of jobs posted per date and the normalized job titles is illustrated in the scatterplot above. The normalized job titles are color encoded. The size of the marker depends on the counts of jobs posted per date. The more jobs have been posted per date, the larger the diameter of the marker is. The plot does not unveil any interesting relationship.

The relationship between the count of jobs posted per date and the top 5 industries is illustrated in the scatterplot above. The industries are color encoded. The size of the marker depends on the counts of jobs posted per date. The more jobs have been posted per date, the larger the diameter of the marker is. The plot does not unveil any interesting relationships.

The relationship between the count of jobs posted per date and the top 5 locations is illustrated in the scatterplot above. The locations are color encoded. The size of the marker depends on the counts of jobs posted per date. The more jobs have been posted per date, the larger the diameter of the marker is. The plot does not unveil any interesting relationships.

2.3.2. Salaries

Let's explore the reported job title salaries. Based on the original dataset salaries_clean a subset just containing records of salaries with value 'ANNUAL' for pay_period named annual_salaries was created. As stated in the summary of the univariate exploration, the further investigation will only focus on annual_salaries as new origin.

The pairgrid above illustrates all numerical features of interest and other features. The diagonal plots a histogram to show the distribution of each variable as already done in bivariate exploration. All other cells provide a scatterplot of a pair of variables.

The base_pay_count, which equals the number of reported salaries, seem to have no correlation with the variables pay_percentile_10, pay_percentile_50 and pay_percentile_90. This means that there is no positive as well as negative correlation, e.g. a large number of reported salaries, does not lead to a higher salary and vice versa.

As expected, the pay_percentile_10, pay_percentile_50 and pay_percentile_90 have a strong positive correlation.

Let's compute and plot pairwise correlation coefficients.

The heatmap above illustrates the pairwise computed pearson standard correlation coefficients. The same variables as plotted within the scatterplots of the pairgrid are used: base_pay_count, pay_percentile_10, pay_percentile_50 and pay_percentile_90. The pearson standard correlation coefficients confirms the observations made from the scatterplots. In addition, the coefficients is closer to 1 for the pair pay_percentile_10 and pay_percentile_50 as well as the pair pay_percentile_90 and pay_percentile_50 compared to pay_percentile_10 and pay_percentile_90. It is just a slight difference, which should not be overinterpreted.

The code above creates the following dedicated subsets of the data:

For the qualitative features of main interest a different color encoding is used:

The relationship of each of those qualitative features with the numerical features base_pay_count, pay_percentile_10, pay_percentile_50 and pay_percentile_90 will be investigated below.

As visualized in the barplot above, 5 times more salaries have been reported for Software Engineers (67%) compared to the second largest number of reported salaries for Project Manager (13%). For Business Analyst 11%, for Product Manager 5% and 3% for Researcher have been reported. Please note that those reported salary numbers are the sum of the base_pay_count values per job title. It's not the number of records per job title. Each record already contains a base pay salary that was computed by Glassdoor based on the base_pay_count value (number of reported salaries).

The pointplot above shows the average of all Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) salaries provided per normalized job title. Additionally, for each data point an error bar representing the 95% confidence interval (5% error rate) using bootstrapping is shown. A seed is used for reproducible bootstrapping. The points are connected as line to see the slopes.

The relationship between the normalized job titles and salaries unveils that the highest average salaries are reported for Product Managers by far. The next higher average salaries are for Project Managers followed by Software Engineers by only a slight difference. The mean median salary of those jobs titles is approx. at the same level as the low salary of Product Managers. Business Analyst average salaries are lower than Software Engineers. The mean high salary of Business Analysts is approx. at the same level at the mean median average salaries of Software Engineers. Researcher have the by far lowest salaries reported.

Besides Product Manager listings have all other titles a smaller error bar, which indicates that there are more reported salaries. The distance between average low, median and high base salaries is the smallest for Researchers. Product Manager listings have the largest distances. A larger distance indicates that a higher variety of reported salaries exists.

The pointplot only shows the mean. Let's investigate the distribution next.

For the Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) salaries a boxplot is displayed above. Each boxplot illustrates the relationship with the top 5 normalized job titles. Additionally, the mean was added as dashed line as reference to the pointplot explored before. Points lower than quantile 5 and higher than quantile 95 will be plotted outside the whisker and are treated as outliers. A log transformation for all salaries was applied, because of the right skewed distribution and large range of data determined during the univariate exploration.

The observations made based on the pointplot can be confirmed by the boxplots. In addition, it's eye-catching that the IQR distributions (the box, 25-50%) is covering a larger salary range for high salaries compared to median salaries as well as for median salaries compared to low salaries.

The barplot above illustrates that 15 times more salaries have been reported for large companies with 10000+ employees (88.5%) compared to the second largest number of reported salaries for companies with 1001 to 5000 employess (6%). Afterwards the shares of companies with 5001 to 10000 employees with 2.5% and companies with 501 to 1001 employees with 2% follow. The share of reported salaries for the 3 smallest companies sums up to less than 1%. Please note that those reported salary numbers are the sum of the base_pay_count values per comnpany size. It's not the number of records per company size. Each record already contains a base pay salary that was computed by Glassdoor based on the base_pay_count value (number of reported salaries).

The pointplot above shows the average of all Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) salaries provided per company size. Additionally, for each data point an error bar representing the 95% confidence interval (5% error rate) using bootstrapping is shown. A seed is used for reproducible bootstrapping. The points are connected as line to see the slopes.

The relationship between the number of employees and salaries shows that the smallest companies with 1 to 50 employees have the lowest mean base salaries reported. The highest average median salaries are reported for companies with 1001 to 5000 employees, followed very closely by companies with 501 to 1000 employees and 10000+ employees. For the average high salaries reported, companies with 501 to 1000 employees are leading, closely follows by companies with 10000+ employees. The average low and median salary of companies is even higher than the highest, but the mean high salaries are lower. Companies with 1001 to 5000 employees have approx. the same mean high salaries as companies with 5001 to 10000 employees. Other mean salaries can be taken from the graphic.

Besides companies with 10000+ employees all companies have a large error bar, which indicates that there are less reported salaries. The distance between average low, median and high base salaries is the smallest for companies with 1 to 50 employees, followed by companies with 51 to 200 employees. Companies with 10000+ employees, 501 to 1000 employees and 5001 to 100001 employess have the largest distances in this order. Companies in between have a distance in between. Especially the distance between the mean median and mean high salaries varies. A larger distance indicates that a higher variety of reported salaries exists.

Let's investigate the distribution next.

For the Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) salaries a boxplot is displayed above. Each boxplot illustrates the relationship with the company size. Additionally, the mean was added as dashed line as reference to the pointplot explored before. Points lower than quantile 5 and higher than quantile 95 will be plotted outside the whisker and are treated as outliers. A log transformation for all salaries was applied, because of the right skewed distribution and large range of data determined during the univariate exploration.

The observations made based on the pointplot can be confirmed by the boxplots. In addition, it's eye-catching that the IQR distributions (the box, 25-50%) is covering a larger salary range for high salaries compared to median salaries as well as for median salaries compared to low salaries.

The barplot above illustrates that 2.5 times more salaries have been reported for the Internet industry (54%) compared to the second largest number of reported salaries for Consulting industry (20%). The IT Services industry gains 12%, the Computer Hardware & Software industry 9% and the Biotech & Pharmaceuticals 5% share of reported salaries.

Please note that those reported salary numbers are the sum of the base_pay_count values per industry. It's not the number of records per industry. Each record already contains a base pay salary that was computed by Glassdoor based on the base_pay_count value (number of reported salaries).

The pointplot above shows the average of all Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) Salaries provided per industry. Additionally, for each data point an error bar representing the 95% confidence interval (5% error rate) using bootstrapping is shown. A seed is used for reproducible bootstrapping. The points are connected as line to see the slopes.

The relationship between the industries and salaries unveils that the Consulting industry have the highest mean base salaries reported. In IT Services industry the second highest salaries are reported. The Computer Hardware & Software and Internet industry follows. The lowest salaries are reported for Biotech & Pharmaceuticals industry.

Besides the Internet industry have all other industries a large error bar, which indicates that there are less reported salaries. The distance between average low, median and high base salaries is the smallest in Biotech & Pharmaceuticals and the largest is in Consulting industry. Especially the distance between the mean median and mean high salaries varies. A larger distance indicates that a higher variety of reported salaries exists.

Let's investigate the distribution next.

For the Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) Salaries a boxplot is displayed above. Each boxplot illustrates the relationship with the top 5 industries. Additionally, the mean was added as dashed line as reference to the pointplot explored before. Points lower than quantile 5 and higher than quantile 95 will be plotted outside the whisker and are treated as outliers. A log transformation for all salaries was applied, because of the right skewed distribution and large range of data determined during the univariate exploration.

The observations made based on the pointplot can be confirmed by the boxplots. In addition, it's eye-catching that the IQR distributions (the box, 25-50%) is covering a larger salary range for high salaries compared to median salaries as well as for median salaries compared to low salaries.

2.3.3. Summary

Job listings (glassdoor_clean dataset)

The relationships of the features of main interest (qualitative variables only) have been investigated.

Following observations for the relationships with number of employees were determined:

Examining the relationships of the top 5 normalized job titles lead to the following observations:

The remaining relationships between the top 5 industries and the top 5 locations revealed that especially in the Internet Industry, but as well in IT Services, Computer Hardware & Software and Enterprise Software & Network Solutions job listings in Berlin have the largest share by far. München (Munich) has for those industries the second largest share. Only for Consulting industry München (Munich) has the most listed followed by Berlin. The job listings for Hamburg are focused in IT Services industry. Frankfurt has only a few listings in Internet and Enterprise Software & Network Solutions industry.

The investigation of relationships of other features with main features of interest did not lead to any insights. There is no interesting relationship between the count of jobs posted per date and

Reported salaries (salaries_clean dataset)

The relationship only for reported salaries with the an annual pay period has been examined.

The investigation of the correlation of all numerical features of main interest and other features lead to the observation that the Low Base Pay Salary pay_percentile_10, the Median Base Pay Salary pay_percentile_50 and the High Base Pay Salary pay_percentile_90 have a strong positive correlation. This was expected, because each record already contains a base pay salary Low, Median and High salary that was computed by Glassdoor based on the base_pay_count value. The base_pay_count, which equals the number of reported salaries per record, seems to have no correlation with the variables pay_percentile_10, pay_percentile_50 and pay_percentile_90. This means that there is no positive as well as negative correlation, e.g., a large number of reported salaries, does not lead to a higher salary and vice versa. Those obersavtions were confirmed by pairwise computing the pearson standard correlation coefficients.

For the qualitative features of main interest the relationship with the numerical features base_pay_count, pay_percentile_10, pay_percentile_50 and pay_percentile_90 has been explored.

Examining the relationships of the top 5 normalized job titles for which annual salaries have been reported lead to the following observations:

Following observations for the relationships with number of employees were determined:

The investigation of relationships of the top 5 industries revealed following:

2.4. Multivariate Exploration

Finally, follow-up the pairwise explored relationships by investigating the relationships between three or more variables. The observed relationships and how the features strengthens each other will be described. Any interesting or suprising interactions will be covered, too.

First the dataset of the job listings (glassdoor_clean dataset) will be explored. Afterwards, the exploration of the reported job title salaries (salaries_clean dataset) follows.

2.4.1. Job Listings

The features of main interest for job listings are only qualitative variables. That's why we have to compute a translation of some of the qualitative variables to a numerical variable (similar to an index). This will allow us to create heatmaps leveraging two qualitative and one numerical feature. In order to decide, what numerical value will be assigned based on the relationship of the two qualitative features, the highest frequency count of the third qualitative, the to be translated variable, will be used.

The above code shows the creation of an dictionary that serves as translation between the qualitative (categorical) and quantative (numerical) values for company sizes. An additional category named 'Multiple Company Sizes' was added to cover cases were the highest frequency count of company size based on the relationship of the two other qualitative features, locations and normalized job titles, is shared among multiple company sizes and is not 0.

The above code shows the creation of an dictionary that serves as translation between the qualitative (String) and quantative (numerical) values for industries. An additional category named 'Multiple Industries' was added to cover cases were the highest frequency count of industry based on the relationship of the two other qualitative features, locations and normalized job titles, is shared among multiple industries and is not 0.

The above code computes the dataframe required to plot the heatmap displaying the highest frequency count of company sizes based on the relationship of the top 5 locations and normalized job titles.

The heatmap above illustrates the highest frequency count of company sizes based on the relationship of the top 5 locations and normalized job titles. For all combinations job listings exist.

The most listings for companies with 1 to 50 employees are in München (Munich) for Software Engineers, Data Scientists, Data Analysts and Project Managers. In 3 out of 5 locations are Software Engineers listed most for companies with 1 to 50 employees. For companies with 51 to 200 employees, Berlin lists the most jobs for Data Engineer, Data Analyst, Product Owner and Product Manager. In 3 out of 5 locations are Product Owners the most listed for companies with 51 to 200 employees. Data Engineers in München, Data Scientist in Hamburg and Product Manager in Frankfurt am Main listings have the highest frequency for comapnies with 201 to 500 employees. In Köln (Cologne), companies with 501 to 1000 employees searching for Product Owner and Product Manager are most frequent listed. Companies with 1001 to 5000 employees have in Köln the most job listings for Software Engineers, Data Scientists and Researchers. Companies with 5001 to 10000 employees only have in München listing for Machine Learning Engineers with the highest frequency. In Berlin and Frankfurt am Main have companies with more than 10000+ employees the most listing. Such large companies provide listings for Business Analysts for 3 out of 5 locations.

Especially the location Köln, as well as all other location have occasionally listings for individual job titles for which the highest frequency count is shared among multiple company sizes.

The above code computes the dataframe required to plot the heatmap displaying the highest frequency count of industries based on the relationship of the top 5 locations and normalized job titles.

The heatmap above illustrates the highest frequency count of the top 5 industries based on the relationship of the top 5 locations and normalized job titles.

For some combination of locations and normalized job titles no listing within the top 5 industries exists. Only in Berlin job listings for Data Analysts, Machine Learning Engineers and Researchers exist within the top 5 industries. For other locations they do not exist. For Product Owner in Köln (Cologne) and München (Munich) no listing within the top 5 industries exist.

The most listings within the Consulting industry are in Frankfurt am Main for Software Engineers, Data Scientists, Business Analysts and Product Owners. In 2 out of 5 locations are Data Scientists, Business Analysts and Project Manager for the Consulting industry the most listed.

Listings for which the Enterprise Software & Network Solutions industry has the highest frequency are rare. Only Product Owners in Berlin are most listed by this industry. It's similar for the Computer Hardware & Software industry. Just in Hamburg Product Managers are most listed for this industry.

In Hamburg the most listings for Software Engineers, Data Engineers, Data Scientists, Business Analysts and Project Managers are for the IT Services industry. Data Engineer for the IT Services industry are listed most frequent in 3 out of 5 locations.

The Internet industry has the highest frequency for the most combinations of location and job titles. The focus of the Internet industry is Berlin. The listing for all job titles besides for Product Owners have the highest frequency in Berlin. In 3 out of 5 locations are Software Engineers for the Internet industry the most listed.

Only the locations Frankfurt am Main and Köln have occasionally listings for individual job titles for which the highest frequency count is shared among multiple industries.

2.4.2. Salaries

Let's explore the reported job title salaries. We have 3 qualitative features of main interest, normalized_job_title, n_employees and industry.

Initially investigate the relationship between the qualitative features with the summed up numerical variable base_pay_count (reported salaries).

Afterwards, explore the relationship of the qualitative features with Median Low, Median and High Base Salaries.

Finally, examine the relationship of the qualitative features with the distribution of the Low, Median and High Base Salaries.

The heatmap above shows that the companies with more than 10000 employees have the most reported salaries for all top 5 normalized job tiles. It's eye-catching that Software Engineers of companies with more than 10000 employees have the most frequent salary reports. The frequency is approx. 8 times larger compared to the second and third largest frequencies of Business Analysts and Project Managers.

Without considering companies with more than 10000 employees, the job title and company size combinations with the largest amount of reported salaries are:

For some combination of normalized job titles and company sizes no reported salaries exists. Especially, for companies with 1 to 50 employees for 3 out of 5 top normalized job titles (Business Analyst, Project and Prouct Manager), no salary report exists. For next larger company size with 51 to 200 employee, Business Analysts and Project Manager salary reports do not exist, too.

In general it seems like that the more employees a company has, the more salaries for the top 5 normalized job titles are reported.

The heatmap above shows that the companies in the Internet industry has the most reported salaries for Software Engineers and Product Managers. It's eye-catching that for Software Engineers in the Internet industry a 10 times larger count of reported salaries compared to the second largest count within the top 5 normalized job titles and industries exists. That's why the Internet industry has the largest count of all industries within the top 5 normalized job titles. The second largest count is for Business Analysts in the Consulting industry. The third largest count, which is the highest count for the IT Services industry, is for Software Engineer. The Biotech & Pharmaceuticals have the most reported salaries for Researchers. Researchers presenting the highest count for the Research & Development industry, too.

For some combination of normalized job titles and industry no reported salaries exists. Only for Software Engineers and Project Manager salary reports for each of the top 5 industries exist. Especially, for Reseachers for 3 out of 5 top industries (Consulting, IT Services and Internet) no salary report exists. Only for Biotech & Pharmaceutical industry salary reports for each of the top 5 normalized job titles exist.

After the Internet industry has the IT Services industry followed by the Consulting industry the largest count of reported salaries among the top 5 normalized job titles.

The heatmap above shows that the companies with more than 10000 employees have the most reported salaries for all top 5 industries. It's eye-catching that companies with more than 10000 employees in the Internet industry have the most reported salaries by far. This count of reported salaries is approx. 3.5 times larger than the count for the second largest count of the Consulting industry.

Without considering companies with more than 10000 employees, the industry and company size combinations with the largest amount of reported salaries are:

For some combination of industries and company sizes no reported salaries exists. Especially, for companies with 1 to 50 employees for 4 out of 5 top industries no salary report exists and for the Computer Hardware & Software industry 1 reported salary exists. That's why only for the Computer Hardware & Software salary reports for all company sizes exist. For the Biotech & Pharmaceutical, Consulting, IT Services and Internet indudstry no salaries have been reported for two individual company sizes each. That's why for each those industries just reported salaries for 5 out of 7 company sizes exist.

In general it seems like that the more employees a company has, the more salaries for the top 5 industries are reported.

The heatmap above shows that the companies with more than 10000 employees have the highest median of the reported median salaries for Product Manager. It's the highest median salary considering all company sizes for the top 5 normnalized job titles.

You can find the highest median of the reported median salaries in the following for the remaining job title and company size combinations:

Larger companies have higher median salaries for Product Manager and smaller companies have higher median salaries for Software Engineers.

For some combination of normalized job titles and company sizes no reported salaries exists. Especially, for companies with 1 to 50 employees for 3 out of 5 top normalized job titles (Business Analyst, Project and Prouct Manager), no salary report exists. For next larger company size with 51 to 200 employee, Business Analysts and Project Manager salary reports do not exist, too.

The heatmap above shows that the companies in the Internet industry have the highest median of the reported median salaries for Product Manager. The second highest median salary is for Business Analysts in the Biotech & Pharmaceutical industry, which is the highest median salary in this industry among the top 5 normalized job titles. The third largest highest salary, which is the highest salary for the IT Services industry, is for Project Manager. The Consulting industry has the highest median salary for Product Managers. Project Manager have the lowest median salary in Research & Development industry, which still is the highest median slary for this industry. Researcher have a higher median salary in Research & Development compared to Biotech & Pharmaceutical industry. The Internet industry provides the highest median salaries among the top 5 industries for Software Engineers. In Consulting, IT Services and Biotech & Pharmaceutical industry Software Engineer only have an approx. 10 % lower high median salary.

For some combination of normalized job titles and industry no reported salaries exists. Only for Software Engineers and Project Manager salary reports for each of the top 5 industries exist. Especially, for Reseachers for 3 out of 5 top industries (Consulting, IT Services and Internet) no salary report exists. Only for Biotech & Pharmaceutical industry salary reports for each of the top 5 normalized job titles exist. After the Biotech & Pharmaceutical industry has the Consulting industry 4 out of 5, followed by the remaining top 5 industries with 3 out of 5 reported salaries among the top 5 normalized job titles.

The heatmap above shows that the companies with more than 10000 employees have the highest median of the reported median salaries the Consulting and Computer Hardware & Software industry. The median salaries in Consulting industry for other company sizes is maximum approx. half as high. The second highest median salary is reported for Internet industry and companies with 201 to 500 employees. Company with such a size have the highest median salary for the IT Services industry, too. For the Biotech & Pharmaceutical industry companies with 5001 to 10000 employees have the median salaries reported.

For some combination of industries and company sizes no reported salaries exists. Especially, for companies with 1 to 50 employees for 4 out of 5 top industries no salary report exists and for the Computer Hardware & Software industry has 1 reported salary exists. That's why only for the Computer Hardware & Software salary reports for all company sizes exist. For the Biotech & Pharmaceutical, Consulting, IT Services and Internet indudstry no salaries have been reported for two individual company sizes each. That's why for each those industries just reported salaries for 5 out of 7 company sizes exist.

For the top 5 normalized job titles and company size, the Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) Salaries each as boxplot are illustrated above. Points lower than quantile 5 and higher than quantile 95 will be plotted outside the whisker and are treated as outliers. A log transformation for all salaries was applied, because of the right skewed distribution and large range of data determined during the univariate exploration.

The observations made based on the previously created heatmap can be confirmed by the boxplots. In addition to the median salary investigated using the heatmap, are boxplots for the low and high reported salaries shown. But there is no further insight, because of the strong relationship between low, median and high reported salaries.

For the top 5 normalized job titles and top5 industries, the Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) Salaries each as boxplot are illustrated above. Points lower than quantile 5 and higher than quantile 95 will be plotted outside the whisker and are treated as outliers. A log transformation for all salaries was applied, because of the right skewed distribution and large range of data determined during the univariate exploration.

The observations made based on the previously created heatmap can be confirmed by the boxplots. In addition to the median salary investigated using the heatmap, are boxplots for the low and high reported salaries shown. But there is no further insight, because of the strong relationship between low, median and high reported salaries.

For the top 5 industries and company size, the Low (Percentile 10), Median (Percentile 50) and High (Percentile 90) Salaries each as boxplot are illustrated above. Points lower than quantile 5 and higher than quantile 95 will be plotted outside the whisker and are treated as outliers. A log transformation for all salaries was applied, because of the right skewed distribution and large range of data determined during the univariate exploration.

The observations made based on the previously created heatmap can be confirmed by the boxplots. In addition to the median salary investigated using the heatmap, are boxplots for the low and high reported salaries shown. But there is no further insight, because of the strong relationship between low, median and high reported salaries.

2.4.3. Summary

Job listings (glassdoor_clean dataset)

The features of main interest for job listings are only qualitative variables. That's why the data was grouped by top 5 normalized job titles, top 5 locations and company size. For each of those unique combinations a frequency count was determined. Following observations were determined:

For the final investigation, the data was grouped by top 5 normalized job titles, top 5 locations and top 5 industries. For each of those unique combinations a frequency count was determined. The examination lead to the following observations:

Reported salaries (salaries_clean dataset)

The relationships of each of the 3 qualitative features of main interest top 5 normalized job titles, top 5 industries and company size with the summed up number of reported salaries were explored. The exploration revealed the following observations:

Finally, the investigation of the relationship of the qualitative features of main interest top 5 normalized job titles, top 5 industries and company size with the Median Low, Median and High Base Salaries was performed. Following observations were made: